pip install pandas
Requirement already satisfied: pandas in ./lib/python3.10/site-packages (1.5.2) Requirement already satisfied: python-dateutil>=2.8.1 in ./lib/python3.10/site-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in ./lib/python3.10/site-packages (from pandas) (2022.7) Requirement already satisfied: numpy>=1.21.0 in ./lib/python3.10/site-packages (from pandas) (1.24.1) Requirement already satisfied: six>=1.5 in ./lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0) Note: you may need to restart the kernel to use updated packages.
pip install numpy
Requirement already satisfied: numpy in ./lib/python3.10/site-packages (1.24.1) Note: you may need to restart the kernel to use updated packages.
pip install plotly
Collecting plotly
Downloading plotly-5.12.0-py2.py3-none-any.whl (15.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.2/15.2 MB 2.9 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01
Collecting tenacity>=6.2.0
Downloading tenacity-8.1.0-py3-none-any.whl (23 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.12.0 tenacity-8.1.0
Note: you may need to restart the kernel to use updated packages.
pip install seaborn
Requirement already satisfied: seaborn in ./lib/python3.10/site-packages (0.12.2) Requirement already satisfied: pandas>=0.25 in ./lib/python3.10/site-packages (from seaborn) (1.5.2) Requirement already satisfied: numpy!=1.24.0,>=1.17 in ./lib/python3.10/site-packages (from seaborn) (1.24.1) Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in ./lib/python3.10/site-packages (from seaborn) (3.6.2) Requirement already satisfied: cycler>=0.10 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.38.0) Requirement already satisfied: packaging>=20.0 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (21.3) Requirement already satisfied: python-dateutil>=2.7 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2) Requirement already satisfied: pyparsing>=2.2.1 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9) Requirement already satisfied: contourpy>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.6) Requirement already satisfied: pillow>=6.2.0 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.4.0) Requirement already satisfied: kiwisolver>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4) Requirement already satisfied: pytz>=2020.1 in ./lib/python3.10/site-packages (from pandas>=0.25->seaborn) (2022.7) Requirement already satisfied: six>=1.5 in ./lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0) Note: you may need to restart the kernel to use updated packages.
pip install matplotlib
Requirement already satisfied: matplotlib in ./lib/python3.10/site-packages (3.6.2) Requirement already satisfied: pyparsing>=2.2.1 in ./lib/python3.10/site-packages (from matplotlib) (3.0.9) Requirement already satisfied: packaging>=20.0 in ./lib/python3.10/site-packages (from matplotlib) (21.3) Requirement already satisfied: python-dateutil>=2.7 in ./lib/python3.10/site-packages (from matplotlib) (2.8.2) Requirement already satisfied: numpy>=1.19 in ./lib/python3.10/site-packages (from matplotlib) (1.24.1) Requirement already satisfied: cycler>=0.10 in ./lib/python3.10/site-packages (from matplotlib) (0.11.0) Requirement already satisfied: contourpy>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib) (1.0.6) Requirement already satisfied: pillow>=6.2.0 in ./lib/python3.10/site-packages (from matplotlib) (9.4.0) Requirement already satisfied: kiwisolver>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib) (1.4.4) Requirement already satisfied: fonttools>=4.22.0 in ./lib/python3.10/site-packages (from matplotlib) (4.38.0) Requirement already satisfied: six>=1.5 in ./lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0) Note: you may need to restart the kernel to use updated packages.
pip install scikit.learn
Collecting scikit.learn
Downloading scikit_learn-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.5/9.5 MB 3.3 MB/s eta 0:00:00m eta 0:00:010:01:010m
Requirement already satisfied: numpy>=1.17.3 in ./lib/python3.10/site-packages (from scikit.learn) (1.24.1)
Collecting threadpoolctl>=2.0.0
Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Collecting scipy>=1.3.2
Downloading scipy-1.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 2.6 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01
Collecting joblib>=1.1.1
Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 298.0/298.0 kB 2.1 MB/s eta 0:00:00m eta 0:00:01[36m0:00:01
Installing collected packages: threadpoolctl, scipy, joblib, scikit.learn
Successfully installed joblib-1.2.0 scikit.learn-1.2.0 scipy-1.10.0 threadpoolctl-3.1.0
Note: you may need to restart the kernel to use updated packages.
pip install yellowbrick
Collecting yellowbrick
Downloading yellowbrick-1.5-py3-none-any.whl (282 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 282.6/282.6 kB 2.5 MB/s eta 0:00:00 MB/s eta 0:00:01:01
Requirement already satisfied: cycler>=0.10.0 in ./lib/python3.10/site-packages (from yellowbrick) (0.11.0)
Requirement already satisfied: scipy>=1.0.0 in ./lib/python3.10/site-packages (from yellowbrick) (1.10.0)
Requirement already satisfied: scikit-learn>=1.0.0 in ./lib/python3.10/site-packages (from yellowbrick) (1.2.0)
Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in ./lib/python3.10/site-packages (from yellowbrick) (3.6.2)
Requirement already satisfied: numpy>=1.16.0 in ./lib/python3.10/site-packages (from yellowbrick) (1.24.1)
Requirement already satisfied: kiwisolver>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.4.4)
Requirement already satisfied: pyparsing>=2.2.1 in ./lib/python3.10/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in ./lib/python3.10/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (2.8.2)
Requirement already satisfied: packaging>=20.0 in ./lib/python3.10/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (21.3)
Requirement already satisfied: contourpy>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.0.6)
Requirement already satisfied: pillow>=6.2.0 in ./lib/python3.10/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (9.4.0)
Requirement already satisfied: fonttools>=4.22.0 in ./lib/python3.10/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (4.38.0)
Requirement already satisfied: joblib>=1.1.1 in ./lib/python3.10/site-packages (from scikit-learn>=1.0.0->yellowbrick) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in ./lib/python3.10/site-packages (from scikit-learn>=1.0.0->yellowbrick) (3.1.0)
Requirement already satisfied: six>=1.5 in ./lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib!=3.0.0,>=2.0.2->yellowbrick) (1.16.0)
Installing collected packages: yellowbrick
Successfully installed yellowbrick-1.5
Note: you may need to restart the kernel to use updated packages.
pip install xgboost
Collecting xgboost
Downloading xgboost-1.7.3-py3-none-manylinux2014_x86_64.whl (193.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.6/193.6 MB 951.3 kB/s eta 0:00:00m eta 0:00:01[36m0:00:02
Requirement already satisfied: scipy in ./lib/python3.10/site-packages (from xgboost) (1.10.0)
Requirement already satisfied: numpy in ./lib/python3.10/site-packages (from xgboost) (1.24.1)
Installing collected packages: xgboost
Successfully installed xgboost-1.7.3
Note: you may need to restart the kernel to use updated packages.
pip install catboost
Requirement already satisfied: catboost in ./lib/python3.10/site-packages (1.1.1) Requirement already satisfied: matplotlib in ./lib/python3.10/site-packages (from catboost) (3.6.2) Requirement already satisfied: pandas>=0.24.0 in ./lib/python3.10/site-packages (from catboost) (1.5.2) Requirement already satisfied: scipy in ./lib/python3.10/site-packages (from catboost) (1.10.0) Requirement already satisfied: graphviz in ./lib/python3.10/site-packages (from catboost) (0.20.1) Requirement already satisfied: plotly in ./lib/python3.10/site-packages (from catboost) (5.12.0) Requirement already satisfied: six in ./lib/python3.10/site-packages (from catboost) (1.16.0) Requirement already satisfied: numpy>=1.16.0 in ./lib/python3.10/site-packages (from catboost) (1.24.1) Requirement already satisfied: python-dateutil>=2.8.1 in ./lib/python3.10/site-packages (from pandas>=0.24.0->catboost) (2.8.2) Requirement already satisfied: pytz>=2020.1 in ./lib/python3.10/site-packages (from pandas>=0.24.0->catboost) (2022.7) Requirement already satisfied: packaging>=20.0 in ./lib/python3.10/site-packages (from matplotlib->catboost) (21.3) Requirement already satisfied: pillow>=6.2.0 in ./lib/python3.10/site-packages (from matplotlib->catboost) (9.4.0) Requirement already satisfied: fonttools>=4.22.0 in ./lib/python3.10/site-packages (from matplotlib->catboost) (4.38.0) Requirement already satisfied: kiwisolver>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib->catboost) (1.4.4) Requirement already satisfied: contourpy>=1.0.1 in ./lib/python3.10/site-packages (from matplotlib->catboost) (1.0.6) Requirement already satisfied: pyparsing>=2.2.1 in ./lib/python3.10/site-packages (from matplotlib->catboost) (3.0.9) Requirement already satisfied: cycler>=0.10 in ./lib/python3.10/site-packages (from matplotlib->catboost) (0.11.0) Requirement already satisfied: tenacity>=6.2.0 in ./lib/python3.10/site-packages (from plotly->catboost) (8.1.0) Note: you may need to restart the kernel to use updated packages.
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from yellowbrick.classifier import ConfusionMatrix
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
credit_pred = pd.read_csv ("/home/hdoop/Downloads/credit_risk_dataset.csv")
credit_pred
| person_age | person_income | person_home_ownership | person_emp_length | loan_intent | loan_grade | loan_amnt | loan_int_rate | loan_status | loan_percent_income | cb_person_default_on_file | cb_person_cred_hist_length | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22 | 59000 | RENT | 123.0 | PERSONAL | D | 35000 | 16.02 | 1 | 0.59 | Y | 3 |
| 1 | 21 | 9600 | OWN | 5.0 | EDUCATION | B | 1000 | 11.14 | 0 | 0.10 | N | 2 |
| 2 | 25 | 9600 | MORTGAGE | 1.0 | MEDICAL | C | 5500 | 12.87 | 1 | 0.57 | N | 3 |
| 3 | 23 | 65500 | RENT | 4.0 | MEDICAL | C | 35000 | 15.23 | 1 | 0.53 | N | 2 |
| 4 | 24 | 54400 | RENT | 8.0 | MEDICAL | C | 35000 | 14.27 | 1 | 0.55 | Y | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 32576 | 57 | 53000 | MORTGAGE | 1.0 | PERSONAL | C | 5800 | 13.16 | 0 | 0.11 | N | 30 |
| 32577 | 54 | 120000 | MORTGAGE | 4.0 | PERSONAL | A | 17625 | 7.49 | 0 | 0.15 | N | 19 |
| 32578 | 65 | 76000 | RENT | 3.0 | HOMEIMPROVEMENT | B | 35000 | 10.99 | 1 | 0.46 | N | 28 |
| 32579 | 56 | 150000 | MORTGAGE | 5.0 | PERSONAL | B | 15000 | 11.48 | 0 | 0.10 | N | 26 |
| 32580 | 66 | 42000 | RENT | 2.0 | MEDICAL | B | 6475 | 9.99 | 0 | 0.15 | N | 30 |
32581 rows × 12 columns
credit_pred.head()
| person_age | person_income | person_home_ownership | person_emp_length | loan_intent | loan_grade | loan_amnt | loan_int_rate | loan_status | loan_percent_income | cb_person_default_on_file | cb_person_cred_hist_length | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22 | 59000 | RENT | 123.0 | PERSONAL | D | 35000 | 16.02 | 1 | 0.59 | Y | 3 |
| 1 | 21 | 9600 | OWN | 5.0 | EDUCATION | B | 1000 | 11.14 | 0 | 0.10 | N | 2 |
| 2 | 25 | 9600 | MORTGAGE | 1.0 | MEDICAL | C | 5500 | 12.87 | 1 | 0.57 | N | 3 |
| 3 | 23 | 65500 | RENT | 4.0 | MEDICAL | C | 35000 | 15.23 | 1 | 0.53 | N | 2 |
| 4 | 24 | 54400 | RENT | 8.0 | MEDICAL | C | 35000 | 14.27 | 1 | 0.55 | Y | 4 |
credit_pred.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32581 entries, 0 to 32580 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 person_age 32581 non-null int64 1 person_income 32581 non-null int64 2 person_home_ownership 32581 non-null object 3 person_emp_length 31686 non-null float64 4 loan_intent 32581 non-null object 5 loan_grade 32581 non-null object 6 loan_amnt 32581 non-null int64 7 loan_int_rate 29465 non-null float64 8 loan_status 32581 non-null int64 9 loan_percent_income 32581 non-null float64 10 cb_person_default_on_file 32581 non-null object 11 cb_person_cred_hist_length 32581 non-null int64 dtypes: float64(3), int64(5), object(4) memory usage: 3.0+ MB
credit_pred.isnull().sum()
person_age 0 person_income 0 person_home_ownership 0 person_emp_length 895 loan_intent 0 loan_grade 0 loan_amnt 0 loan_int_rate 3116 loan_status 0 loan_percent_income 0 cb_person_default_on_file 0 cb_person_cred_hist_length 0 dtype: int64
credit_pred = credit_pred.dropna()
credit_pred.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 28638 entries, 0 to 32580 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 person_age 28638 non-null int64 1 person_income 28638 non-null int64 2 person_home_ownership 28638 non-null object 3 person_emp_length 28638 non-null float64 4 loan_intent 28638 non-null object 5 loan_grade 28638 non-null object 6 loan_amnt 28638 non-null int64 7 loan_int_rate 28638 non-null float64 8 loan_status 28638 non-null int64 9 loan_percent_income 28638 non-null float64 10 cb_person_default_on_file 28638 non-null object 11 cb_person_cred_hist_length 28638 non-null int64 dtypes: float64(3), int64(5), object(4) memory usage: 2.8+ MB
sns.heatmap(credit_pred.isnull())
<AxesSubplot: >
pred = credit_pred.describe()
pred.style.background_gradient (cmap = 'PuBu')
| person_age | person_income | person_emp_length | loan_amnt | loan_int_rate | loan_status | loan_percent_income | cb_person_cred_hist_length | |
|---|---|---|---|---|---|---|---|---|
| count | 28638.000000 | 28638.000000 | 28638.000000 | 28638.000000 | 28638.000000 | 28638.000000 | 28638.000000 | 28638.000000 |
| mean | 27.727216 | 66649.371884 | 4.788672 | 9656.493121 | 11.039867 | 0.216600 | 0.169488 | 5.793736 |
| std | 6.310441 | 62356.447405 | 4.154627 | 6329.683361 | 3.229372 | 0.411935 | 0.106393 | 4.038483 |
| min | 20.000000 | 4000.000000 | 0.000000 | 500.000000 | 5.420000 | 0.000000 | 0.000000 | 2.000000 |
| 25% | 23.000000 | 39480.000000 | 2.000000 | 5000.000000 | 7.900000 | 0.000000 | 0.090000 | 3.000000 |
| 50% | 26.000000 | 55956.000000 | 4.000000 | 8000.000000 | 10.990000 | 0.000000 | 0.150000 | 4.000000 |
| 75% | 30.000000 | 80000.000000 | 7.000000 | 12500.000000 | 13.480000 | 0.000000 | 0.230000 | 8.000000 |
| max | 144.000000 | 6000000.000000 | 123.000000 | 35000.000000 | 23.220000 | 1.000000 | 0.830000 | 30.000000 |
corr = credit_pred.corr()
f, ax = plt.subplots(figsize = (25,25))
sns.heatmap(corr, annot= True)
corr.round(0)
/tmp/ipykernel_42123/3041411044.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. corr = credit_pred.corr()
| person_age | person_income | person_emp_length | loan_amnt | loan_int_rate | loan_status | loan_percent_income | cb_person_cred_hist_length | |
|---|---|---|---|---|---|---|---|---|
| person_age | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.0 | -0.0 | 1.0 |
| person_income | 0.0 | 1.0 | 0.0 | 0.0 | -0.0 | -0.0 | -0.0 | 0.0 |
| person_emp_length | 0.0 | 0.0 | 1.0 | 0.0 | -0.0 | -0.0 | -0.0 | 0.0 |
| loan_amnt | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| loan_int_rate | 0.0 | -0.0 | -0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| loan_status | -0.0 | -0.0 | -0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.0 |
| loan_percent_income | -0.0 | -0.0 | -0.0 | 1.0 | 0.0 | 0.0 | 1.0 | -0.0 |
| cb_person_cred_hist_length | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.0 | -0.0 | 1.0 |
plt.figure(figsize = [40,20])
sns.countplot(x= 'loan_percent_income', hue= 'loan_status', data= credit_pred);
plt.figure(figsize = [25,15])
sns.countplot(x= 'person_age', hue= 'loan_status', data= credit_pred);
defaulter = credit_pred [credit_pred['loan_status'] == 1]
non_defaulter = credit_pred [credit_pred ['loan_status'] == 0]
fig_A = px.histogram(defaulter, x = 'loan_intent', color = 'loan_intent', template = 'plotly_dark')
fig_A.show()
fig_A = px.histogram(defaulter, x = 'person_age', color = 'person_age', template = 'plotly_dark')
fig_A.show()
fig_A = px.histogram(defaulter, x = 'person_income', color = 'person_income', template = 'plotly_dark')
fig_A.show()
fig_A = px.histogram(defaulter, x = 'person_home_ownership', color = 'person_home_ownership', template = 'plotly_dark')
fig_A.show()
fig_A = px.histogram(defaulter, x = 'cb_person_default_on_file', color = 'cb_person_default_on_file', template = 'plotly_dark')
fig_A.show()
fig_A = px.histogram(defaulter, x = 'loan_amnt', color = 'loan_amnt', template = 'plotly_dark')
fig_A.show()
fig_A = px.histogram(non_defaulter, x = 'loan_intent', color = 'loan_intent', template = 'plotly_dark')
fig_A.show()
fig_A = px.histogram(non_defaulter, x = 'person_age', color = 'person_age', template = 'plotly_dark')
fig_A.show()
fig_A = px.histogram(non_defaulter, x = 'person_income', color = 'person_income', template = 'plotly_dark')
fig_A.show()
fig_A = px.histogram(non_defaulter, x = 'person_home_ownership', color = 'person_home_ownership', template = 'plotly_dark')
fig_A.show()
fig_A = px.histogram(non_defaulter, x = 'cb_person_default_on_file', color = 'cb_person_default_on_file', template = 'plotly_dark')
fig_A.show()
fig_A = px.histogram(non_defaulter, x = 'loan_amnt', color = 'loan_amnt', template = 'plotly_dark')
fig_A.show()
grafico = px.scatter_matrix(credit_pred, dimensions=['person_age', 'person_income', 'cb_person_cred_hist_length', 'loan_amnt'], color = 'loan_status')
grafico.show()
grafico = px.parallel_categories (credit_pred, dimensions = {'loan_intent', 'loan_grade', 'loan_status'})
grafico.show()
X_pred = credit_pred.drop (columns = ['loan_status'])
X_pred
| person_age | person_income | person_home_ownership | person_emp_length | loan_intent | loan_grade | loan_amnt | loan_int_rate | loan_percent_income | cb_person_default_on_file | cb_person_cred_hist_length | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22 | 59000 | RENT | 123.0 | PERSONAL | D | 35000 | 16.02 | 0.59 | Y | 3 |
| 1 | 21 | 9600 | OWN | 5.0 | EDUCATION | B | 1000 | 11.14 | 0.10 | N | 2 |
| 2 | 25 | 9600 | MORTGAGE | 1.0 | MEDICAL | C | 5500 | 12.87 | 0.57 | N | 3 |
| 3 | 23 | 65500 | RENT | 4.0 | MEDICAL | C | 35000 | 15.23 | 0.53 | N | 2 |
| 4 | 24 | 54400 | RENT | 8.0 | MEDICAL | C | 35000 | 14.27 | 0.55 | Y | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 32576 | 57 | 53000 | MORTGAGE | 1.0 | PERSONAL | C | 5800 | 13.16 | 0.11 | N | 30 |
| 32577 | 54 | 120000 | MORTGAGE | 4.0 | PERSONAL | A | 17625 | 7.49 | 0.15 | N | 19 |
| 32578 | 65 | 76000 | RENT | 3.0 | HOMEIMPROVEMENT | B | 35000 | 10.99 | 0.46 | N | 28 |
| 32579 | 56 | 150000 | MORTGAGE | 5.0 | PERSONAL | B | 15000 | 11.48 | 0.10 | N | 26 |
| 32580 | 66 | 42000 | RENT | 2.0 | MEDICAL | B | 6475 | 9.99 | 0.15 | N | 30 |
28638 rows × 11 columns
X_pred.values
array([[22, 59000, 'RENT', ..., 0.59, 'Y', 3],
[21, 9600, 'OWN', ..., 0.1, 'N', 2],
[25, 9600, 'MORTGAGE', ..., 0.57, 'N', 3],
...,
[65, 76000, 'RENT', ..., 0.46, 'N', 28],
[56, 150000, 'MORTGAGE', ..., 0.1, 'N', 26],
[66, 42000, 'RENT', ..., 0.15, 'N', 30]], dtype=object)
X_pred = X_pred.values
type (X_pred)
numpy.ndarray
y_pred = credit_pred.iloc [:,8].values
y_pred
array([1, 0, 1, ..., 1, 0, 0])
label_encoder_teste = LabelEncoder()
X_pred [4]
array([24, 54400, 'RENT', 8.0, 'MEDICAL', 'C', 35000, 14.27, 0.55, 'Y', 4],
dtype=object)
label_encoder_person_home_ownership = LabelEncoder()
label_encoder_loan_intent = LabelEncoder()
label_encoder_loan_grade = LabelEncoder()
label_encoder_cb_person_default_on_file = LabelEncoder()
X_pred [:, 2] = label_encoder_person_home_ownership.fit_transform (X_pred [:,2])
X_pred [:, 4] = label_encoder_loan_intent.fit_transform (X_pred [:,4])
X_pred [:, 5] = label_encoder_loan_grade.fit_transform (X_pred [:,5])
X_pred [:, 9] = label_encoder_cb_person_default_on_file.fit_transform (X_pred [:,9])
X_pred [4]
array([24, 54400, 3, 8.0, 3, 2, 35000, 14.27, 0.55, 1, 4], dtype=object)
X_pred
array([[22, 59000, 3, ..., 0.59, 1, 3],
[21, 9600, 2, ..., 0.1, 0, 2],
[25, 9600, 0, ..., 0.57, 0, 3],
...,
[65, 76000, 3, ..., 0.46, 0, 28],
[56, 150000, 0, ..., 0.1, 0, 26],
[66, 42000, 3, ..., 0.15, 0, 30]], dtype=object)
onehotencoder_pred = ColumnTransformer(transformers= [('OneHot', OneHotEncoder(), [0,2,4,5,9,10])], remainder = 'passthrough')
X_pred = onehotencoder_pred.fit_transform (X_pred). toarray()
X_pred
array([[0.000e+00, 0.000e+00, 1.000e+00, ..., 3.500e+04, 1.602e+01,
5.900e-01],
[0.000e+00, 1.000e+00, 0.000e+00, ..., 1.000e+03, 1.114e+01,
1.000e-01],
[0.000e+00, 0.000e+00, 0.000e+00, ..., 5.500e+03, 1.287e+01,
5.700e-01],
...,
[0.000e+00, 0.000e+00, 0.000e+00, ..., 3.500e+04, 1.099e+01,
4.600e-01],
[0.000e+00, 0.000e+00, 0.000e+00, ..., 1.500e+04, 1.148e+01,
1.000e-01],
[0.000e+00, 0.000e+00, 0.000e+00, ..., 6.475e+03, 9.990e+00,
1.500e-01]])
X_pred.shape
(28638, 110)
scaler_pred = StandardScaler()
X_pred = scaler_pred.fit_transform(X_pred)
X_pred [0]
array([-2.21156066e-02, -1.96148135e-01, 2.83796804e+00, -3.67834582e-01,
-3.50295021e-01, -3.22636605e-01, -2.88538610e-01, -2.65592581e-01,
-2.45187635e-01, -2.34522752e-01, -2.02305697e-01, -1.91002839e-01,
-1.75953836e-01, -1.64887684e-01, -1.49368905e-01, -1.41096125e-01,
-1.30839248e-01, -1.20368793e-01, -1.06470402e-01, -9.75590411e-02,
-9.11552192e-02, -8.75822726e-02, -7.61246586e-02, -7.05914692e-02,
-6.37733171e-02, -5.64599460e-02, -5.48821300e-02, -5.15836998e-02,
-4.76956485e-02, -3.78644533e-02, -3.96712966e-02, -3.39653422e-02,
-3.39653422e-02, -3.01448110e-02, -2.70892883e-02, -2.50784931e-02,
-2.21156066e-02, -2.28922276e-02, -2.43714887e-02, -1.32145256e-02,
-2.13107595e-02, -1.67160753e-02, -1.44760403e-02, -1.02355700e-02,
-1.56361836e-02, -1.32145256e-02, -1.67160753e-02, -5.90930274e-03,
-1.32145256e-02, -1.32145256e-02, -8.35716200e-03, -5.90930274e-03,
-5.90930274e-03, -5.90930274e-03, -5.90930274e-03, -5.90930274e-03,
-1.02355700e-02, -8.37195816e-01, -5.73860735e-02, -2.87899081e-01,
9.83926906e-01, -4.35467034e-01, -4.98712041e-01, -3.54552601e-01,
-4.76161204e-01, 2.20727264e+00, -4.59972905e-01, -6.99121631e-01,
-6.85270103e-01, -4.98439082e-01, 2.79591098e+00, -1.77005730e-01,
-8.57417516e-02, -4.54362512e-02, -2.14755511e+00, 2.14755511e+00,
-4.72349845e-01, 2.11484701e+00, -4.72571017e-01, -2.48055676e-01,
-2.46704525e-01, -2.49085105e-01, -2.48451997e-01, -2.48372772e-01,
-2.46066714e-01, -1.20517988e-01, -1.22881884e-01, -1.17041147e-01,
-1.24337832e-01, -1.14407190e-01, -1.17041147e-01, -1.10749198e-01,
-2.28922276e-02, -2.50784931e-02, -3.18381379e-02, -2.50784931e-02,
-2.70892883e-02, -2.57661525e-02, -3.07195863e-02, -2.43714887e-02,
-2.36434040e-02, -2.50784931e-02, -2.77272550e-02, -1.96023628e-02,
-2.57661525e-02, -1.22673849e-01, 2.84534330e+01, 4.00398376e+00,
1.54216384e+00, 3.95252678e+00])
X_pred_train, X_pred_test, y_pred_train, y_pred_test = train_test_split (X_pred, y_pred, test_size =0.30, random_state= 0)
X_pred_train.shape
(20046, 110)
X_pred_test.shape
(8592, 110)
X_pred_train.shape , y_pred_train.shape
((20046, 110), (20046,))
X_pred_test.shape, y_pred_test.shape
((8592, 110), (8592,))
naive_pred = GaussianNB()
naive_pred.fit(X_pred_train, y_pred_train)
predictions = naive_pred.predict (X_pred_test)
predictions
array([1, 1, 1, ..., 1, 1, 1])
y_pred_test
array([0, 0, 0, ..., 0, 0, 0])
y_pred_train
array([0, 0, 0, ..., 0, 0, 0])
accuracy_score (y_pred_test, predictions)
0.21752793296089384
confusion_matrix (y_pred_test, predictions)
array([[ 67, 6702],
[ 21, 1802]])
con_mat = ConfusionMatrix (naive_pred)
con_mat.fit (X_pred_train, y_pred_train)
con_mat.score (X_pred_test, y_pred_test,)
0.21752793296089384
print (classification_report (y_pred_test, predictions))
precision recall f1-score support
0 0.76 0.01 0.02 6769
1 0.21 0.99 0.35 1823
accuracy 0.22 8592
macro avg 0.49 0.50 0.18 8592
weighted avg 0.64 0.22 0.09 8592
from sklearn.tree import DecisionTreeClassifier
pred_tree = DecisionTreeClassifier(criterion = 'entropy')
pred_tree.fit(X_pred_train, y_pred_train)
DecisionTreeClassifier(criterion='entropy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(criterion='entropy')
prediction = pred_tree.predict (X_pred_test)
prediction
array([0, 0, 0, ..., 0, 0, 0])
accuracy_score ( y_pred_test, prediction)
0.8864059590316573
con_mat = ConfusionMatrix (pred_tree)
con_mat.fit (X_pred_train, y_pred_train)
con_mat.score (X_pred_test, y_pred_test,)
0.8864059590316573
print (classification_report (y_pred_test, prediction))
precision recall f1-score support
0 0.93 0.92 0.93 6769
1 0.72 0.75 0.74 1823
accuracy 0.89 8592
macro avg 0.83 0.84 0.83 8592
weighted avg 0.89 0.89 0.89 8592
from sklearn.ensemble import RandomForestClassifier
random_forest_pred = RandomForestClassifier(n_estimators=40, criterion='entropy', random_state = 0)
random_forest_pred.fit(X_pred_train, y_pred_train)
RandomForestClassifier(criterion='entropy', n_estimators=40, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(criterion='entropy', n_estimators=40, random_state=0)
predictions = random_forest_credit.predict(X_pred_test)
predictions
array([0, 0, 0, ..., 0, 0, 0])
y_pred_test
array([0, 0, 0, ..., 0, 0, 0])
accuracy_score(y_pred_test, predictions)
0.93237895716946
cm = ConfusionMatrix(random_forest_pred)
cm.fit(X_pred_train, y_pred_train)
cm.score(X_pred_test, y_pred_test)
0.93237895716946
print(classification_report(y_pred_test, predictions))
precision recall f1-score support
0 0.93 0.99 0.96 6769
1 0.97 0.71 0.82 1823
accuracy 0.93 8592
macro avg 0.95 0.85 0.89 8592
weighted avg 0.93 0.93 0.93 8592
Comparing the three machine learning aglorithms (Naive Bayes, Decision Tree and Random Forest Techniques), the Random Forest Technique has a better performance index with 93% accuracy considering the dataset used.